Polish Morphological Guesser Based on a Statistical A Tergo Index
نویسندگان
چکیده
We present a direct method of construction of a morphosyntactic guesser for Polish, which is a program producing morphosyntactic descriptions for word forms unknown to the morphological analyser. The core of the method is the construction of a statistical a tergo index, in which pseudo-suffixes (endings) extracted by a statistical tree define morpho-syntactic properties of corresponding word forms. The secondary aim was to investigate to what extent it is possible to develop the morphological analyses exclusively on the basis of endings. Experiments in the extraction of a guesser for a domain of texts are also presented. The method can be applied to any other inflectional language with only minor technical changes.
منابع مشابه
Towards Czech Morphological Guesser
This paper presents a morphological guesser for Czech based on data from Czech morphological analyzer ajka [1]. The idea behind the presented concept lies in a presumption that the new (and therefore unknown to the analyzer) words in a language behave quite regularly and that a description of this regular behaviour can be extracted from the existing data of the morphological analyzer. The paper...
متن کاملcomparing a statistical and a constraint - based method
In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disam-biguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems an...
متن کاملHandling Unknown Words in Arabic FST Morphology
A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inc...
متن کاملDescribing Linde’s Dictionary of Polish for Digitalisation Purposes
The present paper describes the attempts at digitalising the so called Linde’s dictionary of Polish published in 6 volumes between 1807 and 1814 by Samuel Bogumił Linde. We are working on a formal description of the dictionary’s structure, whose purpose will be to allow programmers to design a tool for automatic tagging of the text. The dictionary is multilingual, so performing OCR with good qu...
متن کاملCombining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing
Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms all...
متن کامل